Text Classification By Bootstrapping With Keywords, EM And Shrinkage

نویسندگان

  • Andrew McCallum
  • Kamal Nigam
چکیده

When applying text classification to complex tasks, it is tedious and expensive to hand-label the large amounts of training data necessary for good performance. This paper presents an alternative approach to text classification that requires no labeled documentsi instead, it uses a small set of keywords per class, a class hierarchy and a large quantity of easilyobtained unlabeled documents. The keywords are used to assign approximate labels to the unlabeled documents by termmatching. These preliminary labels become the starting point for a bootstrapping process that learns a naive Bayes classifier using Expectation-Maximization and hierarchical shrinkage. When classifying a complex data set of computer science research papers into a 70-leaf topic hierarchy, the keywords alone provide 45% accuracy. The classifier learned by bootstrapping reaches 66% accuracy, a level close to human agreement. 1 I n t r o d u c t i o n When provided with enough labeled training examples, a variety of text classification algorithms can learn reasonably accurate classifiers (Lewis, 1998; Joachims, 1998; Yang, 1999; Cohen and Singer, 1996). However, when applied to complex domains with many classes, these algorithms often require extremely large training sets to provide useful classification accuracy. Creating these sets of labeled data is tedious and expensive, since typically they must be labeled by a person. This leads us to consider learning algorithms that do not require such large amounts of labeled data. While labeled data is difficult to obtain, unlabeled data is readily available and plentiful. Castelli and Cover (1996) show in a theoretical framework that unlabeled data can indeed be used to improve classification, although it is exponentially less valuable than labeled data. Fortunately, unlabeled data can often be obtained by completely automated methods. Consider the problem of classifying news articles: a short Perl script and a night of automated Internet downloads can fill a hard disk with unlabeled examples of news articles. In contrast, it might take several days of human effort and tedium to label even one thousand of these. In previous work (Nigam et al., 1999) it has been shown that with just a small number of labeled documents, text classification error can be reduced by up to 30% when the labeled documents are augmented with a large collection of unlabeled documents. This paper considers the task of learning text classifiers with no labeled documents at all. Knowledge about the classes of interest is provided in the form of a few keywords per class and a class hierarchy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Effect of Time and Temperature on Moisture Content, Shrinkage, and Rehydration of Dried Onion

In this paper, the experimental data of the onion drying process by a batch cabinet dryer is investigated. Obtained experimental...

متن کامل

An Evaluation of Algorithms for Single-Echo Biosonar Target Classification

A recent neuro-spiking coding scheme for feature extraction from biosonar echoes of various plants is examined with a variety of stochastic classifiers. Feature vectors derived are employed in well-known stochastic classifiers, including nearest-neighborhood, single Gaussian and a Gaussian mixture with EM optimization. Classifiers’ performances are evaluated by using cross-validation and bootst...

متن کامل

ارائه روشی برای استخراج کلمات کلیدی و وزن‌دهی کلمات برای بهبود طبقه‌بندی متون فارسی

Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...

متن کامل

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors

One may need to build a statistical parser for a new language, using only a very small labeled treebank together with raw text. We argue that bootstrapping a parser is most promising when the model uses a rich set of redundant features, as in recent models for scoring dependency parses (McDonald et al., 2005). Drawing on Abney’s (2004) analysis of the Yarowsky algorithm, we perform bootstrappin...

متن کامل

Studies on Main Properties of Ternary Blended Cement with Limestone Powder and Microsilica

The ternary system of Portland cement-microsilica-limestone has been studied by investigating its set and strength behaviours. A number of different cementitious <span style="font-size: 10pt; color: #0...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999